This markdown file is created by Sunah Park for extended lab exercises in the book An Introduction to Statistical Learning with Applications in R (ISLR).
library(MASS)
library(ISLR)
library(ggplot2)summary(Boston)## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
head(Boston,3)## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
Boston<-na.omit(Boston)
sum(is.na(Boston$medv))## [1] 0
names(Boston)## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
lm() function is used to fit a simple linear regression model.
lm.fit<-lm(medv~lstat, data=Boston) # y: medv, x: lstat from Boston database
summary(lm.fit) # infos Std.Error, tvalue, P-value, R-sqaured, F-stat##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
names(lm.fit) # Informations stored in linear model## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit) # Coefficients of linear model## (Intercept) lstat
## 34.5538409 -0.9500494
confint(lm.fit) # Confidence interval for the coefficient estimates## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
pred.lm.fit<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="confidence"); pred.lm.fit## fit lwr upr
## 1 33.6037915 32.564024 34.643559
## 2 28.8535448 28.111211 29.595878
## 3 24.1032980 23.546025 24.660571
## 4 19.3530512 18.753385 19.952718
## 5 14.6028045 13.767222 15.438387
## 6 9.8525577 8.700886 11.004229
## 7 5.1023109 3.604296 6.600326
## 8 0.3520641 -1.505704 2.209832
## 9 -4.3981826 -6.622617 -2.173748
## 10 -9.1484294 -11.743514 -6.553345
pred.lm.fit2<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="prediction"); pred.lm.fit2## fit lwr upr
## 1 33.6037915 21.347614 45.859969
## 2 28.8535448 16.619011 41.088079
## 3 24.1032980 11.878597 36.327999
## 4 19.3530512 7.126344 31.579758
## 5 14.6028045 2.362259 26.843350
## 6 9.8525577 -2.413620 22.118735
## 7 5.1023109 -7.201218 17.405839
## 8 0.3520641 -12.000428 12.704556
## 9 -4.3981826 -16.811114 8.014749
## 10 -9.1484294 -21.633109 3.336250
The fitted model for simple linear regression: \[ {medv}=34.55-0.95\times{lstat} \]
ggplot(data=Boston, aes(x=lstat, y=medv))+geom_jitter(alpha=.7, color="grey")+stat_smooth(method="lm", level=0.95)+theme_bw(base_family="Avenir")df<-data.frame(x=Boston$lstat,
y=Boston$medv,
y.fit=lm.fit$fitted.values,
residuals=lm.fit$residuals,
i=seq(1:length(Boston$lstat)))When fitting a least squares line of linear model, we generally require following conditions.
ggplot(df,aes(x=x,y=residuals))+geom_point(alpha=0.3)+geom_hline(yintercept=0, color="blue",linetype="solid",size=0.5) Diagnose: There is some evidence of non-linearity
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue") Diagnose: There is a right skew
ggplot(data=df, aes(x=y.fit,y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Diagnose: The variability of residuals is non-constant.
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Diagnose: There is no increasing or decreasing pattern of residuals. The model does not suffer from time-series structure.
# Model selection
lm.fit<-lm(medv~., data=Boston) # Full model where y: medv, x: all predictors in Boston
summary(lm.fit) # infos Std.Error, tvalue, p-value, R-squared, F-stat##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
# Model selection using p-value Backwards elimination (Start with the full model -> Drop the variable with the highest p-value and refit a smaller model -> Repeat until all variables left in the model are significant (p-value<alpha)).
lm.fit<-lm(medv~.-age, data=Boston) # The variable age is now removed from the model, as it has the highest p-value that is higher than the chosen significance level alpha of 0.05 in this case.
summary(lm.fit)##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
lm.fit<-lm(medv~.-age-indus, data=Boston) # The variable age and indus are now removed from the model
# Alternatively, the update() function can be used (e.g. update(lm.fit, ~.-age))
summary(lm.fit) # We now have all the variables whose p-value is less than alpha. Our final model has 11 predictors.##
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
The final equation is: \[ {medv}=36.34-0.11\times{crim}+0.05\times{zn}+2.72\times{chas}-17.38\times{nox}+3.80\times{rm}-1.49\times{dis}0.30\times{rad}-0.01\times{tax}-0.95\times{ptratio}-0.01\times{black}-0.52\times{lstat} \]
attributes(summary(lm.fit))## $names
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
##
## $class
## [1] "summary.lm"
summary(lm.fit)$adj.r.squared## [1] 0.7348058
The fitted model has adjusted R-squared of 0.73. In other words, 73% of the variability in our response variable medv can be explained by the model.
df<-data.frame(y=Boston$medv,
y.fit=lm.fit$fitted.values,
residuals=lm.fit$residuals,
abs.residuals<-abs(lm.fit$residuals),
i=seq(1:length(Boston$lstat)))Multiple regression methods generally depend on the following four assumptions:
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue") Diagnose: There is a right skew and some outliers, but not highly extreme.
ggplot(data=df, aes(x=y.fit,y=abs.residuals))+geom_jitter(alpha=.5) Diagnose: There are three data points which have residual value higher than 20, but not highly significant. The variance is fairly ok.
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5) Dignose: We see no structure that indicates a time-series issue.
In case of discrete variables: